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Analysis  of  Interactions  Between  Categorical 

Variables 

by 

S.  Kullback  and  P.  N.  Reeves 
The  George  Washington  University 

The  principle  of  minimum  discrimination  information 
estimation  and  associated  procedures  are  applied  to  data 
from  a  survey  of  hospitals  to  determine  the  relationship 
of  innovativeness  on  certain  hospital  characteristics. 

Introduction 

The  literature  in  the  health  field  contains  many  example 
of  cases  where  a  researcher  has  been  unable  to  deal  with  in¬ 
teraction  between  variables  even  though  it  is  generally 
believed  that  such  interaction  does  exist.  The  inability  to 
deal  with  these  interactions  stems  from  the  fact  that  often, 
data  in  the  health  field  do  not  meet  the  assumptions  for  use 
of  classical  parametric  procedures.  Until  recently  none  of 
the  non-parametric  techniques  available  could  cope  with  this 
kind  of  situation. 

A  recent  article,  Johnson  et.  al.  [2]  contains  a  brief 
list  of  various  techniques  now  available  and  gives  a  detailed 
description  of  one  particular  method  for  dealing  with  certain 


types  of  this  sort  of  data.  Among  the  approaches  listed 
but  not  described  in  detail  [2]  refers  to  the  minimum  dis¬ 
crimination  information  methodology  reported  in  [3] .  Sub¬ 
sequent  research  has  extended  the  minimum  discrimination 
information  methodology  [4],  [6],  and  enhanced  its  usefulness 
for  health  research. 

The  purpose  of  this  article  is  two-fold.  First,  to 
point  out  that  the  reservations  noted  in  [2]  do  not  restrict 
the  usei  lness  of  this  approach.  Second,  to  illustrate  the 
technique  using  health  data  so  that  researchers  in  the  health 
field  can  become  aware  of  this  valuable  tool  and  add  it  to 
their  armamentarium. 

"Multivariate  data  analysis  needs  a  large  and  flexible 
class  of  hypothetical  distributions  of  free  variables  indexed 
by  the  values  of  fixed  variables.  From  this  class,  appropri¬ 
ate  subfamilies  would  be  chosen  for  fitting  to  specific  data 
sets"  [1] .  The  principle  of  minimum  discrimination  information 
estimation  and  its  basis  the  minimum  discrimination  information 
theorem,  which  is  quite  general  in  its  formulation,  lead  to 
exponential  families  of  distributions  [4] ,  [5]  ,  [6] .  The  ex¬ 
ponential  families  have  very  useful  and  desirable  statistical 
properties  and  contain  many  subfamilies  in  common  use  [1] . 

"The  data  analytic  attitude  to  mouels  is  empirical  rather  than 
theoretical.  ...When  detailed  theoretical  understanding  is 
unavailable,  a  more  empirical  attitude  is  natural,  so  that  es- 


timation  of  parameters  in  models  should  be  seen  less  as 
attempts  to  discover  underlying  truth  and  more  as  data 
calibrating  devices  which  make  it  asier  to  conceive  of  noisy 
data  in  terms  of  smooth  distributions  and  relations.  Ex¬ 
ponential  families  are  viewed  here  as  intended  for  use  in  the 
empirical  mode.  With  a  given  data  set,  a  variety  of  models 
may  be  tried  on,  and  one  selected  on  the  grounds  of  looks  and 
fit"  [1] .  When  the  minimum  discrimination  information  estimates 
provide  a  satisfactory  fit  to  a  set  of  data  a  complete  analysis, 
including  significance  tests  and  estimates  describing  the  pat¬ 
tern  of  observations  is  provided. 

We  propose  to  present  an  example  of  the  use  of  the  principle 
of  minimum  discrimination  information  estimation,  its  related 
statistics,  exponential  family  and  analysis  of  information,  in 
relation  to  a  matter  of  concern  to  health  administrators.  The 
data  used  are  rom  the  field  of  hospital  administration  and 
relate  to  the  matter  of  innovation  in  hospitals.  We  begin  with 
the  assumption  that  the  use  of  electronic  data  processing  (EDP) 
in  hospitals  in  the  late  1960's  was  innovative.  This  assumption 
is  substantiated  by  a  variety  of  surveys  of  the  use  of  EDP  in 
hospitals.  (See  Jacobs,  Reeves,  and  Hammond  article  to  be 
published  in  Hospitals . )  On  this  basis  the  data  in  a  survey  of 
hospitals  using  EDP  conducted  by  Herner  and  Co.  were  combined 
with  data  from  the  Guide  Issue  of  Hospitals  for  the  same  period 
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so  that  a  file  of  records  reflecting  characteristics  of 
hospitals  and  levels  at  which  EDP  was  used  by  these  hospitals 
was  created.  The  hospitals  in  this  survey  were  selected  by 
stratified  sampling.  The  stratification  (fixed  variable)  was 
on  the  basis  of  hospital  size.  All  hospitals  in  the  large- 
size  category  (200  or  more  beds)  were  included  in  the  survey 
and  a  ten  percent  sample  was  taken  of  those  in  the  small  size 
category.  The  data  from  these  files  were  tabulated  and  ar¬ 
ranged  in  multiway  contingency  tables.  The  analysis  of  the 
tables  for  the  large  and  small  hospitals  will  be  described 
here  and  interrelated  to  illustrate  the  use  of  the  minimum 
discrimination  information  estimation  technique.  Computer 
programs  have  been  prepared  and  are  available  to  provide  the 
necessary  output  for  the  analysis. 

On  the  basis  of  these  analyses  we  conclude  that  there  is 
a  distinct  relation  of  innovation  on  location  and  length  of 
stay  with  a  common  factor  for  large  and  small  hospitals.  The 
association  (measured  by  the  logarithm  of  the  cross-product 
ratio)  between  use  of  EDP  and  length  of  stay  is  the  same  for 
the  large  and  small  hospitals.  The  log-odds  (logit)  of  use 
of  EDP  in  descending  order  of  magnitude  within  the  large  hos¬ 
pitals  and  within  the  small  hospitals  are  parallel  in  terms 
of  the  combinations  of  the  factors  location  and  length  of 
stay.  The  usage  of  EDP  is  generally  greater  in  the  large 
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hospitals  than  in  the  small  hospitals  except  that  the 
best  log-odds  for  the  small  hospitals  is  greater  than  the 
poorest  log-odds  for  the  large  hospitals. 

Hospital  Characteristics  Associated  With  Use  of  EDP 

In  a  study  to  identify  characteristics  which  distinguish 
hospitals  which  use  EDP  from  those  which  do  not,  that  is,  to 
identify  characteristics  which  are  significantly  associated 
with  use  of  EDP,  data  on  1176  hospitals,  923  large  and  253 
small,  were  collected  with  respect  to  use,  location,  and  length 
of  stay.  The  data  appear  in  the  two  three-way  2x2x2  contin¬ 
gency  tables  1  and  2 .  In  order  to  determine  the  relation 
among  the  free  variables  use,  location,  and  length  of 
stay,  indexed  by  size  of  hospital,  and  interactions  that  may 
exist  among  these  characteristics  it  seems  intuitively  clear 
that  an  analysis  based  only  on  two-way  tables  would  not  suf¬ 
fice.  We  shall  analyze  the  data  using  the  principle  of  mini¬ 
mum  discrimination  information  estimation  and  its  associated 
statistics,  as  presented  and  discussed  in  [4],  [6]. 

We  shall  denote  the  occurrences  in  the  observed  tables 
1  and  2  respectively  by  x(i.ik),  y(ijk)  with 


i=l,  user;  i=2 ,  non-user 
j=l,  urban;  j=2,  rural 
k=l,  short;  k=2 ,  long. 
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The  proposed  procedure  provides  estimates  for  the  original 
data  analogous  to  a  regression  procedure  using  sets  of  ob¬ 
served  marginals  as  explanatory  variables  and  we  shall  try 
to  find  an  estimate  which  does  not  differ  significantly 
from  the  observed  data.  The  set  of  acceptable  estimates 
will  indicate  the  nature  of  the  significant  interactions 
for  which  we  can  compute  numerical  measures. 

As  a  first  step  in  the  analysis  we  shall  find  "smoothed" 
estimates  of  the  original  data.  We  shall  do  this  for  the 
large  hospitals  also  even  though  the  data  for  all  large 
hospitals  was  collected.  We  examine  the  minimum  discrimina¬ 
tion  information  estimates  obtained  by  a  convergent  iterative 
algorithm  starting  with  a  uniform  table  and  successively 
adjusting  for  sets  of  observed  marginals.  It  turns  out  that 
the  set  of  two-way  marginals  are  best  and  the  resultant  es¬ 
timates  provide  a  satisfactory  fit.  The  estimated  tables 
have  the  same  two-way  and  also  the  same  one-way  marginals  as 
the  original  tables  [4]  ,  [6]  .  These  estimates  which  we  de¬ 
note  by  x|(ijk),  y2[  (ijk)  respectively  for  the  large  and  small 

hospitals  are  given  in  tables  3  and  4  and  imply  no  second- 
order  (three-factor)  interaction.  Note  that  the  estimate  for 
the  observed  y(122)=0  is  y£ (122)=0.137. 

The  estimates  are  given  analytically  by  the  log-linear 
representation  of  an  exponential  family 


X* (i jk) 

^nmr  (ijk) 
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=  L+x^T^  (i jk) +t (ijk)+x^T^  (ijk)+x^T^  (ijk) 

+T11T11  (ijk)+TnTii  (i3k) 
where  n=£££x  (i jk)  ,  n  (i  jk)  =1/2  *2  *2  ,  L  is  a  normalizing  constant, 
the  taus  are  main-effect  and  interaction  parameters,  and  the 
T(ijk)  are  a  set  of  linearly  independent  random  variables,  in 
this  case  the  indicator  functions  of  the  respective  marginals. 

A  similar  representation  holds  for  y*(ijk).  The  log-linear 
representations  are  shown  graphically  in  Figure  1  [6] .  The 
values  in  the  various  columns  of  Figure  1,  zeros  or  ones,  are 
the  values  of  the  respective  functions  T(ijk).  Note  that 
Til  (i jk)  =T^  (ijk)  T^  (ijk)  ,T^  (i jk)  =T*  (ijk)  T*  (ijk)  , 

T11 ^ijk)=Tj (ijk)T^ (ijk)  . 

To  test  the  goodness-of-f it  of  the  estimates  we  compute 
the  statistics  [4],  [6], 

2l(x:x2p=2j£]ix(ijk)£n(x(ijk)  /x£  (ijk))  =0.481,  l.D.F. 

21  (y :y*)=2£££y (i jk)£n  (y  (i jk)/y*  (ijk) )=0.294  ,  1  D.F . 

2 

Since  the  statistics  are  asymptotically  distributed  as  x 
we  conclude  that  the  "smoothed"  values  x^,y^  are  good  estimates 

and  we  shall  use  them  in  our  subsequent  analysis. 

From  the  log-linear  representation  (1)  or  the  graphical 
presentation  in  Figure  1,  we  find  that  the  log-odds  or  logits 
of  the  use  of  EDP  for  large  hospitals  is  given  by  the  parametric 
representation 


- - 


X*  (111) 


x£  (112) 
£n"x|(212) 


x2  (3.21)  i 
£nx*(221)  =  T1 


+ 


c 


T 


ik 

11 


,  x2(122)  i 

^x*  (222)  =  T1 


where  the  values  of  the  parameters  for  the  estimate  x*(ijk) 
are  found  to  be 

T1  =  -1-4842,  xH  =  0.5113,  =  1.5103. 

From  (2)  we  also  see  that  for  the  large  hospitals 

ij  _  0_x2  (HI)  xj  (221)  x*(112)x*(222) 

TH  "  £nx*(211)x*(l2l)  =  £n^J(212)x*(12J)  =  °-5H3, 

that  is,  the  association  between  usage  and  location  for  either 
short  or  long  stay.  Similarly 

ik  _  x*(lll)x*(212)  x*(121)x*(222) 

T11 '  £n>!jT5irnt|7TTJT  =  i)x* (las)  ■  L51“3. 

that  is,  the  association  between  usage  and  stay  for  either 
urban  or  rural  location. 

For  the  small  hospitals  the  log-odds  or  logits  are 


-9- 


y?  (ill) 


_  Ti  +  Tij  + 

“  1  T11  T11 


y2(112)  i  ii 

£ny*(212)  T1  +  T11 


y2 (121)  _  i 
£ny*(22l)  "  T1 


ik 

+  T11 


y£  (122) 


where  the  values  of  the  parameters  for  the  estimate  y£(ijk) 
are  found  to  be 

■ x J  =  -3.3357,  x*j  =  1.3088,  =  0.9836. 


For  the  small  hospitals  we  also  have 


Tii  ■  £n 


y*(lU)y*(221>  y* (112)y* (222) 

vF7TTTTv?TTTrT  =  *nv'» 


■  1.3088, 


that  is,  the  association  between  usage  and  location  for  either 

short  or  long  stay.  Similarly 

ik  _  y*(lll)y*(212)  y*  (121)y* (222) 

T11  =  ^nyJ (211) yj (112 )  =  £ny*  (221)y* (122)  =  °*9836' 

that  is,  the  association  between  usage  and  stay  for  either 
urban  or  rural  locations . 

Since  the  data  for  the  large  hospitals  reflect  observa¬ 
tions  over  all  such  hospitals,  it  will  be  of  interest  to 
determine  whether  there  exists  a  suitable  estimate  for  the 


-10- 


small  hospitals,  other  than  y^(ijk),  which  will  have  some  of 

its  interactions  (associations)  the  same  as  the  corresponding 
values  for  the  large  hospitals.  This  can  be  accomplished  by 
using  the  iterative  algorithm  fitting  various  subsets  of 
marginals  of  y£  (ijk)  (or  the  original  y(ijk))but  starting  with 

a  distribution  which  has  the  same  tau  parameters  as  xj(ijk). 

The  tau  parameters  of  x£(ijk)  not  affected  by  the  iterative 

fitting  procedure  will  be  "inherited"  by  the  resultant  esti¬ 
mate.  We  shall  use  the  table  v (ijk) = (233/923) x£ (ijk)  which 

has  the  same  tau  parameters  as  the  x$ { ijk)  table  with  total 

adjusted  to  be  the  same  as  the  observed  total  of  small  hospi¬ 
tals. 

We  summarize  the  procedure :  starting  the  iterative  fit¬ 
ting  algorithm  with  v(ijk)  (recall  that  y(ijk)  and  y£(ijk)  have 

the  same  two-way  and  one-way  marginals) 

Tau  Parameters 
"inherited" 


Marginals  fitted 

Estimate 

from  v(ijk) 

a) 

y (i.k) ,y ( . jk) 

u*  (ijk) 

11 

b) 

y  (ij  .)  ,y  (.  jk) 

u*(ijk) 

Tik 

11 

c) 

y (ij . ) ,y (i.k) 

uj(ijk) 

T  jk 

11 

d) 

y (• jk) ,y(i. .) 

u*(ijk) 

xij  -ik 
llf  11 

e) 

y (i.k) ,y  (. j.) 

u* (ijk) 

Tjk 

11'  11 

f) 

y  (ij  . )  ,y  (.  .k) 

u£ (ijk) 

Tik  Tjk 

11'  11 

g) 

y(i..),y(.j.),y(..k) 

u*(ijk) 

Tij  xik  x jk 

11'  11'  11 

Y 

-ii- 

In  order  to  test  whether  the  u*  estimates  differ  signifi¬ 
cantly  from  the  yi  estimates,  that  is,  whether  the  interac- 

*■ 

\ 

tion  parameters  in  y£  differ  significantly  from  the  inter¬ 
action  parameters  in  u*  "inherited"  from  x£  or  v,  we  compute 
the  statistic 

21  (y^:u*)=2niy^  (ijk)£n  (y*  (i jk) /u* (ijk) ) 

2 

which  is  asymptotically  distributed  as  x  with  1  D.F.  for 
m=a,b,c,  2  D.F.  for  m=d,e,f,  3  D.F.  for  m=g. 

The  only  case  which  yielded  a  non-significant  value  was 
u*(ijk)  for  which 


2I(y*:u*)  =  0.408,  1  D.F. 

The  values  of  u£(ijk)  are  given  in  Table  5. 

The  log-linear  representation  for  u*(ijk)  in  terms  of 


v(ijk)  is 

u* (ijk) 


ln  v(i]k)~L+TlTl  (Mk)+T  JtJ  (ijk)  +t^T^ (ijk) 

+TiiTii(ijk)+TiW^(ijk) 


.kmk , . 


llxll 


.ik 


(3) 


Note  that  does  not  appear  explicitly  in  (J).  By  using 


the  log-linear  representation  for  v(ijk)  itself  we  also  get 
the  reparametrization  or  log-linear  representation  for  u*(ijk) 


in  terms  of  the  uniform  distribution 
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in 


ug(ijk) 
nn (ijk) 


=  L+t*T^  (ijk)+x  JtJ  (ijk)+T*T*  (ijk) 

+T  liTli  (i^k)  +TiiTi^  (i  jk)+t^T^  (ijk) 


(4) 


We  remark  that  the  numerical  values  of  the  taus  in  (3) 


and  (4)  are  not  the  same. 


The  log-odds  or  logits  of  the  use  of  EDP  for  small 
hospitals  may  now  be  given  by  the  parametric  represen¬ 
tation 


u*(lll) 

£nuJ7TETT 


fn^(112)  _  i  +  ij 

£nu£T3T2T  Ti  Tn 


u*(i2i)  i 

ln^(T2i)  =  Ti 

u*U22)  i 

£nu*(222)  T1 
b 


+  x 


ik 

11 


(5) 


where  the  values  of  the  parameters  in  (5)  are 
xj  =  -3.8569,  =  1.3354,  =  1.5103. 

For  the  small  hospitals  we  now  have  the  associations 

ii  u* (111) u* (221)  u* (112)u*(222) 

Tll  =  £nuJf2nTuJ(121J  =  £nu*(212)u*(12TT  =  1,3354 

and 


ik  _  n_«J(Hl)ug(212) 
T11  ”  £nu*(211)u*(llTy 


In 


u*(121)u*(222) 

uJT22TTuJTT557 


1.5103. 


Note  that  t^,  the  association  between  osage  and  location 

for  the  small  hospitals  is  still  different  from  that  for  the 

large  hospitals,  but  that  the  association  between  usage  and 
iK. 

stay,  x  ,  is  now  the  same  for  both  large  and  small  hospitals. 

Arranging  the  log-odds  of  usage  in  descending  order  of 
magnitude  within  the  large  hospitals  and  within  the  small  hos¬ 
pitals  we  find 


Large  hospitals 


Factors 


Small  hospitals 


x*  (111) 

*nx|(mr  °-5374 

x! (121) 

tox«(221>-  °-0262 
xi (112) 

ta4^T'0-9729 

x? (122) 

(*^23TUM1 


Urban, Short 

Rural ,  Short 

Urban , Long 

Rural, Long 


u*(lll) 


-1.0111 


tn  i  1  ~ — 1 

u*  (211) 

u*(121) 

£nu* (22l)  2  *  3466 

u*(112) 

£nuJ(2ii)=~2-5214 

u*(122) 

*nu*(2^r~3-8569 


Conclusion 


There  are  many  instances  in  which  a  .researcher  needs  some 
technique  that  will  allow  him  to  take  into  consideration  the 
interactions  of  many  variables,  particularly  qualitative,  that 
do  not  meet  the  stringent  assumptions  underlying  parametric 
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statistical  testing  procedures.  Contingency  table  analysis 
based  upon  the  minimum  discrimination  information  technique 
is  a  tool  that  is  available  to  fill  this  need.  We  have  seen 
that  application  of  this  technique  to  certain  types  of 
problems  mentioned  in  [2]  is  indeed  feasible.  We  have  il¬ 
lustrated  the  use  of  this  technique  by  showinc  that  innova¬ 
tion  in  hospitals  as  indicated  by  the  adoption  of  EDP  is 
significantly  associated  with  location  and  length  of  stay, 
the  latter  association  being  the  same  for  both  large  and 
small  hospitals.  Furthermore,  innovativeness  is  most  pro¬ 
nounced  for  large  hospitals  with  short  stay  and  least  for 
small  hospitals  with  long  stay. 
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Table  1 


Large  Hospitals  x(ijk) 


Urban 

Rural 

1 

HEBEI 

K '2ZBM 

H£225j 

m 

User 

i  mam 

52 

15 

Non-user 

1  217  1 

Ill'S 

54 

57 

MBS 

106 

72 

Table  2 


Small  Hospitals  y(ijk) 


Urban 

Rural 

Short 

Long 

Short 

Long 

User 

28 

2 

ii 

■I 

Non-user 

ao_; 

14 

114 

■9 

ismi 

16 

125 

4| 

Table  3 


Large  Hospitals  x£ (i jk) 


Urban 

Rural 

Short 

Long 

Short 

Long 

User 

iubem 

■HKIl 

Mai 

483.000 

Non-user 

218.693 

110.308 

1  52.307 

HH^ 

440.000 

[  923.000 

Tab! e  4 


Small  Hospitals  y£ (ijk) 


User 

Non-user 

Ur 

ban 

Rural 

41.000 

212.000 

Short 

Long 

Short 

Lonq 

28.137 

79.863 

1.863 

14.137 

10.863 

114.137 

6.157 

3.863 

[108 . 000 

16.000 

125.000 

4.000 

253.000 
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